Overview

This project aims at identifying predictors of health outcomes using socioeconomic factors. Specifically, income, housing costs, food security, and food costs will be tested for association with surveyed outocmes from Behavioral Risk Factor Surveillance System (BRFSS) epidemiologic data. BRFSS data is tracked at the metropolitan level, and SMART BRFSS 2017 data was downloaded as XPT file here Measures of mental and physical health will be analyzed using secondary data from the American Community Survey (ACS), Housing Urban Development (HUD) fair market rent (fmr) data, Feeding America dataset. The main faculty for this project is Blanca Himes who gave project advice and provided a reference paper (Leszinsky et al.) and Sherry Xie, who assisted with geospatial plotting and formatting.

Introduction

Describe the problem addressed, its significance, and some background to motivate the problem.

Wealth inequality has been increasing in the United States for several decades. One estimate by Congressional Budget Office (CBO) estimates a 20% increase in income inequality between 1980-2016. This is seen as increased wealth accumulation for the high-earning Americans (top 1%) who have seen their share of aggregate income increase by nearly 20% as middle-class americans saw a decreased share. Despite a steady increase in the Gross Domestic Product (GDP), real median income has not risen at a similar rate. This suggests that increased productivity in the United States has not led to increased earnings for most Americans over the past four decades. In addition to wage stagnation, the cost of living in the Country has risen as well. For most people, the largest share of expenses is taken by housing costs, accounting for 30-40% of an average American’s monthly budget. According to the Center on Budget and Policy Priorities (CBPP), median rents across the nation have risen by 13% between 2001-2008, far outpacing median household income growth. Thus, on average, a larger proportion of American incomes are being spent on necessities. Associations between income and health have been intensly studied across many economic, social policy, and epidmiologic fields. One study looked at National Health and Nutrition Examination Survey (NHANES) data which is taken from a survey conducted by the CDC. The authors investigated how survey participants responsed to questions about their general health. The study finds that those in households with high income self-reported “good heath” more frequently than those in middle to low income household. The authors also found that increased income correlates to decreased “stress load” as measured as a weighted average of 5 biomarkers associated with stress such as blood pressure and creatine levels. Overall, it is clear that an individual’s economic status strongly correlates to health outcomes. High income individuals have lower stress and better general health. Chronic stressors such as food insecurity, high costs-of-living and lack of access to preventivive/primary care all contribute to health disparities by income. Based on these findings, I wanted to combine individual demographics (including income) with aggregate data to better predict several health outcomes from the BRFSS. I wanted to see how median metropolitan statitics of poverty and income predict outcomes compared to invidiual income statistics and if combining the two increases predicitive power. Additionally, I wanted to include median rent data and food expense data, as these are two of the largest expenses in any given household. Thus, I would expect that health outcomes might vary depending on both income and cost-of-living as a result of increased financial insecurity.

This project is interdisciplinary because combines economic data with health survey data. I wanted to see if BRFSS data can be combined with metro-level cost-of-living factors to better predict health outcomes compared to individual demographics alone. I will use data from multiple sources including the CDC, BLS, HUD, ACS, and Feeding America to accomplish these goals.

Methods

Describe the data used and general methodological approach. Subsequently, incorporate full R code necessary to retrieve and clean data, and perform analysis. Be sure to include a description of code so that others (including your future self) can understand what you are doing and why.

For exploring growth rate, I looked at change in the consumer price index (CPI) between 2018-2020 for major metropolitan areas in the United States. This data was taken from the Bureau of Labor Statistics (BLS) To look at county-level income and poverty data, I query 5-year ACS data from 2014-2018 for poverty rate and median income. I also collected HUD fmr rate for the past from 2017-2020. Finally I use the “Map the Gap” calculated county-level meal cost from the non-profit Feeding America to estimate food expense. This cost-of-food index is calculated by applying county-level sales tax rates to Nielsen basket-of-goods based on actual pounds purchased. This “meal cost” value was last calculated in 2010 by Feeding America. Necessary corrections to nominal housing costs for inflation were corrected using CPI values from 2010-2020. BRFSS data is ta

In the first section, I explore economic factors across the country either at a metroplotian level or by county level. Using geospatioal maps, explore income, rent, and percent of rent spent on income across the country. I also looked at how the inflation rate (marker for economic growth) varies by major US cities. I also explore county-level poverty and income rates. Finally I make some plots checking the assocition between poverty and income as well as poverty and costs-of-living (rent, food expense).

Finally, I load the 2017 BRFSS dataset and clean the dataframe. I check how aggregate factors (by Metro Area) affect health outcomes (general health, COPD history, and history of depression). I also look at how individual income alone affects these same outcomes. For these analyses I use a combination of plots and Chi-squared tests to check for independence. I create a logistic regression model combining BRFSS demographic factors (age, sex, race, income) to predict these outcomes. Finally, I add aggregate “cost-of-living” data to this logistic regression model to see if predictive power improves.

Results

Download and install Rtools40 to use the BLS api: https://cran.r-project.org/bin/windows/Rtools/

Reading and Cleaning Datasets

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
## 
##     extract
## Loading required package: usethis

Query CPI data from BLS

The first part of this project is to do some exploratory analyses and visualizations of some key economic and cost-of-living factors in the United States across counties or by metropolitan area.First, queried CPI data between 2018-2020 from the Bureau of Labor Statistics (BLS) to show how economic growth varies by major metropolitan areas.

This is just data-wrangling to properly pull CPI data using the BLS api package

##   year period periodName   value signature_monthly state                      area
## 1 2020    M10    October 260.892       CUURS35ESA0    MD Baltimore-Columbia-Towson
## 2 2020    M08     August 259.336       CUURS35ESA0    MD Baltimore-Columbia-Towson
## 3 2020    M06       June 257.942       CUURS35ESA0    MD Baltimore-Columbia-Towson
## 4 2020    M04      April 258.978       CUURS35ESA0    MD Baltimore-Columbia-Towson
## 5 2020    M02   February 259.121       CUURS35ESA0    MD Baltimore-Columbia-Towson
## 6 2019    M12   December 257.847       CUURS35ESA0    MD Baltimore-Columbia-Towson
##   year period periodName   value signature_annual state                          area
## 1 2020    S01   1st Half 258.891      CUUSS35ESA0    MD     Baltimore-Columbia-Towson
## 2 2019    S02   2nd Half 257.288      CUUSS35ESA0    MD     Baltimore-Columbia-Towson
## 3 2019    S01   1st Half 256.485      CUUSS35ESA0    MD     Baltimore-Columbia-Towson
## 4 2018    S02   2nd Half 254.382      CUUSS35ESA0    MD     Baltimore-Columbia-Towson
## 5 2018    S01   1st Half 252.401      CUUSS35ESA0    MD     Baltimore-Columbia-Towson
## 6 2020    S01   1st Half 244.889      CUUSS35CSA0    GA Atlanta-Sandy Springs-Roswell

These data frames are the CPI for major US metropolitan areas by month or by year from 2018-2020 for available data.

## # A tibble: 6 x 5
##   area                          year  `1st Half` `2nd Half` annual
##   <chr>                         <chr>      <dbl>      <dbl>  <dbl>
## 1 Baltimore-Columbia-Towson     2020        259.        NA    259.
## 2 Baltimore-Columbia-Towson     2019        256.       257.   257.
## 3 Baltimore-Columbia-Towson     2018        252.       254.   253.
## 4 Atlanta-Sandy Springs-Roswell 2020        245.        NA    245.
## 5 Atlanta-Sandy Springs-Roswell 2019        242.       246.   244.
## 6 Atlanta-Sandy Springs-Roswell 2018        238.       239.   239.
## # A tibble: 6 x 15
##   area                           year October August  June April February December September  July   May March January November annual
##   <chr>                         <dbl>   <dbl>  <dbl> <dbl> <dbl>    <dbl>    <dbl>     <dbl> <dbl> <dbl> <dbl>   <dbl>    <dbl>  <dbl>
## 1 Baltimore-Columbia-Towson      2020    261.   259.  258.  259.     259.      NA         NA    NA    NA    NA      NA       NA   259.
## 2 Baltimore-Columbia-Towson      2019    258.   257.  257.  259.     254.     258.        NA    NA    NA    NA      NA       NA   257.
## 3 Baltimore-Columbia-Towson      2018    255.   255.  254.  252.     252.     253.        NA    NA    NA    NA      NA       NA   254.
## 4 Atlanta-Sandy Springs-Roswell  2020    249.   248.  245.  243.     247.      NA         NA    NA    NA    NA      NA       NA   246.
## 5 Atlanta-Sandy Springs-Roswell  2019    246.   246.  243.  243.     240.     245.        NA    NA    NA    NA      NA       NA   244.
## 6 Atlanta-Sandy Springs-Roswell  2018    239.   241.  240.  237.     237      237.        NA    NA    NA    NA      NA       NA   239.

Now let’s visualize economic growth in various metropolitan areas around the US. The average %CPI change (one measure of economic growth) is shown as a red dotted line (~2% from 2018-2020). We see that Growth varies by area. High-growth cities such as San Francisco are experiencing a high rate of inflation, beyond national average, suggesting a surge in costs-of-living. On the other hand of the spectrum, St. Louis, MO, Detroit, MI, and Houston, TX have slowed in in economic growth.

## `summarise()` ungrouping output (override with `.groups` argument)

Query Fair market rent (FMR) data from HUD

In this part of the project, I will query FMR data from the HUD. The FMR is one estimate of fair market rent for a given county. I will use this as a measure of housing costs. Althought not perfect, fmr is thorough, available for all US counties, and tracks relatively well to actual median rents in most places.

## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
## Loading required package: grid
library(stringr)
library(grid)

# function to format the geo id column of FMR dataframes
format_id <- function(df){
  
  df$ID <- str_replace(as.character(df$fips2010),
                       as.character(df$CouSub),"")
  df$ID <- as.numeric(df$ID) + 50000000000
  df$ID <- paste0("0",as.character(df$ID))
  
  return(df)
}

# read in the housing costs: fair market rates (2017-2020)
# format relevent columns into a standard form
fmr_2017 <- read.csv("FY2017_50_County_FMR.csv", fileEncoding = "UTF-8-BOM") %>%
  dplyr::mutate_at(c("fmr0", "fmr1", "fmr2", "fmr3", "fmr4"), as.numeric)
fmr_2018 <- read.csv("FY2018_50_County_FMR.csv", fileEncoding = "UTF-8-BOM") %>%
  dplyr::mutate_at(c("fmr_0", "fmr_1", "fmr_2", "fmr_3", "fmr_4"), as.numeric) %>%
  dplyr::rename("CouSub" = "cousub", "fmr0" = "fmr_0","fmr1" = "fmr_1","fmr2" = "fmr_2", 
         "fmr3" = "fmr_3", "fmr4" = "fmr_4", "Metro_code" = "metro_code")
fmr_2019 <- read.csv("FY2019_50_County_FMR.csv", fileEncoding = "UTF-8-BOM") %>%
  dplyr::mutate_at(c("fmr_0", "fmr_1", "fmr_2", "fmr_3", "fmr_4"), as.numeric) %>%
  dplyr::rename("CouSub" = "cousub","fmr0" = "fmr_0","fmr1" = "fmr_1","fmr2" = "fmr_2", 
         "fmr3" = "fmr_3", "fmr4" = "fmr_4", "Metro_code" = "metro_code") 
fmr_2020 <- read.csv("FY2020_50_County_FMR.csv", fileEncoding = "UTF-8-BOM") %>%
  dplyr::mutate_at(c("fmr_0", "fmr_1", "fmr_2", "fmr_3", "fmr_4"), as.numeric) %>%
  dplyr::rename("CouSub" = "cousub", "fmr0" = "fmr_0","fmr1" = "fmr_1","fmr2" = "fmr_2", 
         "fmr3" = "fmr_3", "fmr4" = "fmr_4", "Metro_code" = "metro_code") 
fmr_2021 <- read.csv("FY2021_50_County_FMR.csv", fileEncoding = "UTF-8-BOM") %>%
  dplyr::mutate_at(c("rent50_0", "rent50_1", "rent50_2", "rent50_3", "rent50_4"),
            as.numeric) %>%
  dplyr::rename("CouSub" = "cousub", "fmr0" = "rent50_0",
         "fmr1" = "rent50_1","fmr2" = "rent50_2", 
         "fmr3" = "rent50_3", "fmr4" = "rent50_4",
         "Metro_code" = "cbsasub21", "areaname" = "areaname21",
         "countyname" = "cntyname") 

# read in shape file of US counties
counties <- readRDS(gzcon(url('https://raw.githubusercontent.com/HimesGroup/BMIN503/master/DataFiles/uscounties_2010.rds'))) # decompress  rds file and read in shape file
counties$GEO_ID <- as.character(counties$GEO_ID)#change geo id to characters in counties df
counties$ID <- gsub("US","", counties$GEO_ID)  #remove US from ID

# read in metro area shape file and convert into spatial components
setwd("tl_2017_us_cbsa")
cbsa.sf <- st_read("tl_2017_us_cbsa.shp")
## Reading layer `tl_2017_us_cbsa' from data source `C:\Users\chana\Box Sync\Graduate School\Courses\Fall 2020\BMIN 503\Final Project\BMIN503_Final_Project\tl_2017_us_cbsa\tl_2017_us_cbsa.shp' using driver `ESRI Shapefile'
## Simple feature collection with 945 features and 12 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -178.4436 ymin: 17.83151 xmax: -65.42271 ymax: 65.45352
## geographic CRS: NAD83
cbsa.sf %<>%
  dplyr::select(GEOID, geometry,NAME, NAMELSAD) %>%
  dplyr::rename(ID2 = GEOID, Metro = NAMELSAD)


# function to merge FMR dataframes with county geometries 
geo_format <- function(geo_df){
  fmr_geo <- counties %>%
    dplyr::select(c("NAME", "ID", "geometry")) %>%
    left_join(geo_df[,c("ID","fmr0", "fmr1", "fmr2", "fmr3", "fmr4",
                        "Metro_code", "areaname", "countyname")], by="ID") %>%
    distinct(Metro_code, .keep_all = TRUE)
}

# function to merge FMR dataframes with metro area geometries 
geo_format2 <- function(geo_df){
  
  geo_df %<>%
    dplyr::select(ID,fmr0,fmr1,fmr2, fmr3, fmr4, Metro_code, areaname, countyname)%>%
    distinct(Metro_code, .keep_all = TRUE) %>% # removed repeated counties
    dplyr::mutate(ID2 = substr(Metro_code, 12,16)) # metro IDs
  
  fmr_geo <- cbsa.sf %>%
    dplyr::select(c("NAME", "ID2", "geometry", "Metro")) %>%
    left_join(geo_df, by="ID2") # join by metro area

}

# format geographical IDs for all FMRs 2017-20201
fmr_2017 <- format_id(fmr_2017)
fmr_2018 <- format_id(fmr_2018)
fmr_2019 <- format_id(fmr_2019)
fmr_2020 <- format_id(fmr_2020)
fmr_2021 <- format_id(fmr_2021)

# add county geographical data to all FMRs 2017-20201
fmr_2017_geo <- geo_format(fmr_2017)
fmr_2018_geo <- geo_format(fmr_2018)
fmr_2019_geo <- geo_format(fmr_2019)
fmr_2020_geo <- geo_format(fmr_2020)
fmr_2021_geo <- geo_format(fmr_2021)

# add metro area geographical data to all FMRs 2017-20201
fmr_2017_geo2 <- geo_format2(fmr_2017)
fmr_2018_geo2 <- geo_format2(fmr_2018)
fmr_2019_geo2 <- geo_format2(fmr_2019)
fmr_2020_geo2 <- geo_format2(fmr_2020)
fmr_2021_geo2 <- geo_format2(fmr_2021)

To visualize housing (rent) expenses I use the sf and ggplot packages to make static chloropleth maps for median 1-bed rent costs across US counties. This is done throuh a function which takes in the year to be visualized (fmr for 2017-2020 available). Additonally, the function adjusts the fmr to 2020 $ using the CPI between that year and the CPI for 2020. As expected, housing costs are high in major coastal cities including New York, Boston, and San Fancisco. I also made a function to plot static maps of 1-bed rents for available metropolitan area data. This data is not completely catalogued like the county-level data, as it only considers major US cities.

library(RColorBrewer)
library(leaflet)

my_theme <- function() {
  theme_minimal() +                                  # shorthand for white background color
  theme(axis.line = element_blank(),                 # further customization of theme components
        axis.text = element_blank(),                 # remove x and y axis text and labels
        axis.title = element_blank(),
        panel.grid = element_line(color = "white"),  # make grid lines invisible
        legend.key.size = unit(0.8, "cm"),           # increase size of legend
        legend.text = element_text(size = 16),       # increase legend text size
        legend.title = element_text(size = 16),
        plot.title = element_text(hjust = 0.5))      # increase legend title size
}

myPalette <- colorRampPalette(brewer.pal(9, "BuPu"))    # Blue-Purple
myPalette2 <- colorRampPalette(brewer.pal(9, "YlOrRd")) # Yellow-Orange-Red
myPalette3 <- colorRampPalette(brewer.pal(9, "YlGnBu")) # Yellow-Green-Blue

# geographic visualizations of housing costs
cpi_by_year <- read.csv("annual_cpi_us.csv", fileEncoding = "UTF-8-BOM")
cpi_by_year$adjust <- cpi_by_year$cpi/258.2937 # adjustment factor into 2020 dollars

# use the cpi_by_year adjustment to adjust rents to 2010 dollars
# rent in y year = (rent in current year) X (CPI y year/CPI current year)

# Select a color palette with which to run the palette function
pal_fun  <- colorNumeric("BuPu", NULL)      # Blue-Purple from RColorBrewer
pal_fun2 <- colorNumeric("YlOrRd", NULL)    # Yellow-Orange-Red from RColorBrewer

# function to get inflation correction factor for a given year into 2020 dollars
i_factor <- function(year){
  c_factor <- cpi_by_year$adjust[which(cpi_by_year$year == year)]
  return(c_factor)
}

# function to generate static map of rent costs by year across all coutnies in US
map_rent <- function(df, year){
    correction = i_factor(year)
 
  fmr_map = ggplot() +
    geom_sf(data=df,lwd=0, 
            aes(fill = fmr1/correction)) + # adjust factor included
    ggtitle(paste(year, "Median Rent for 1-bed")) + 
    scale_fill_gradientn(name = "Rent (2020 $)", 
                         colours = myPalette2(100),
                         limits = c(0,2000),
                         breaks = c(500,1000,1500,2000)) + # add limits and breaks to normalize scaling between years  
    north(data = df, symbol = 12, location = "bottomright") +
    scalebar(data = df, dist = 500, dist_unit = 'mi', transform = TRUE,
             location = "bottomleft",
             st.dist = 0.1) +
    my_theme()
}

# call static map function and display
fmr1_2017_map <- map_rent(fmr_2017_geo, 2017) # generate static rent map for counties
fmr1_2017_map # display rent map

Next I create an interative map of US median rents using the R leaflet package and the same HUDfmr data. Like the previous section, this visualization can plot data for FMRs between 2017-2020 and adjust prices to 2020 dollars.

# an interactive map of rent prices
library(htmlwidgets)
library(htmltools)

pal_fun <- colorNumeric("YlOrRd", NULL)  # Yellow-Orange-Red from RColorBrewer


# ------------------------------------------------- #
# HTML code to fix misaligned "na" in map legend
# ------------------------------------------------- #
library(htmlwidgets)
css_fix <- "div.info.legend.leaflet-control br {clear: both;}" # CSS to correct spacing
html_fix <- htmltools::tags$style(type = "text/css", css_fix)  # Convert CSS to HTML


# Funciton allows to plot interactive leaflet maps of median US rents by year
map_rent_i <- function(df, year){

#---------------------------------------------------#
# HTML code to add title to interactive maps
#---------------------------------------------------#
tag.map.title <- tags$style(HTML("
.leaflet-control.map-title { 
    transform: translate(-50%,20%);
    position: fixed !important;
    left: 50%;
    text-align: center;
    padding-left: 10px; 
    padding-right: 10px; 
    background: rgba(255,255,255,0.75);
    font-weight: bold;
    color: #000000;
    font-size: 20px;
  }
"))

title <- tags$div(
  tag.map.title, HTML(paste(year, "Median Rent:", 1, "Bed"))
)  
    
  correction = i_factor(year)# find correction factor
  
  # Pop-up message
  pu_message <- paste0("County: ", df$countyname,  
                     "<br>Rent: ", "$", floor(df$fmr1/correction)) # in 2010 dollars
  
  # Adding more customization 
  interactive_map <- leaflet(df) %>%
    addPolygons(stroke = FALSE,                        # remove polygon borders
                fillColor = ~pal_fun(fmr1/correction), # in 2010 dollars
                fillOpacity = 0.5, smoothFactor = 0.5, # increase opacity and resolution
                popup = ~pu_message) %>%
    addProviderTiles(providers$CartoDB.Positron) %>%   # add third party provider tile
    #addProviderTiles(providers$Stamen.Toner) %>%
    #addProviderTiles(providers$Esri.NatGeoWorldMap)
    addLegend("bottomright",                           # location of legend
              pal=pal_fun,                             # palette function
              values=~fmr1/correction,                  # variable to be passed to palette function
            title = paste("Rent",1,"bed (2020 $)"),  # legend title
            opacity = 1) %>% # legend opacity (1 = completely opaque)
    addScaleBar() %>%
    addTiles() %>%
    addControl(title, position = "topleft", className="map-title") #add title panel
  
  return(interactive_map %<>% htmlwidgets::prependContent(html_fix))
}

# Call interactive map function and display 
fmr_2020_mapi <- map_rent_i(fmr_2020_geo, 2020) 
fmr_2020_mapi

ACS poverty and income data

In this part of the project, I gather 5-year American Community Survey (ACS) data from the US census bureau. Namely I am looking for median household income, and poverty rates (calculated as the percent of families falling below poverty threshold over total families surveyed).

## Getting data from the 2014-2018 5-year ACS

I make another static map of median income across the US counties using 5-year ACS data (2014-2018). It is clear that high income is concentrated in coastal cities as expected.

In this section, I wanted to see if ACS data and fmr data correlate with each other, and how these factors may be used independently or together in future analyses of health outcomes.

Here I explore the relationship between poverty rates and median household income as well as median 1-bed rents. Poverty tracks non-linearly with median household income but not as well with median fmr. I then looked at FMR as a percent of median income to see if the amount of one’s budget spent on rent might be a better indicator for poverty. In those ggplots, a linear regression fits well, althought it remains unclear to what extend housing costs play into the regression versus income alone.

# calculate poverty rate and percent spent on housing by county for one year

thresh <- 0 # threshold for filtering poverty estimates

acs.fmr.2017 %<>%
  dplyr::filter(poverty_estimate > thresh) %>%
  dplyr::mutate(poverty_percent = poverty_estimate/poverty_total) %>%
  dplyr::mutate(housing_percent = (fmr1*12)/med_house_income)
acs.fmr.2018 %<>%
  dplyr::filter(poverty_estimate > thresh) %>%
  dplyr::mutate(poverty_percent = poverty_estimate/poverty_total) %>%
  dplyr::mutate(housing_percent = (fmr1*12)/med_house_income)
acs.fmr.2019 %<>%
  dplyr::filter(poverty_estimate > thresh) %>%
  dplyr::mutate(poverty_percent = poverty_estimate/poverty_total) %>%
  dplyr::mutate(housing_percent = (fmr1*12)/med_house_income)
acs.fmr.2020 %<>%
  dplyr::filter(poverty_estimate > thresh) %>%
  dplyr::mutate(poverty_percent = poverty_estimate/poverty_total) %>%
  dplyr::mutate(housing_percent = (fmr1*12)/med_house_income)


# plot map of percentage spent on median 1-bed apartment by median income
rent_income_map <- ggplot() +
  geom_sf(data=acs.fmr.2017,lwd=0, 
          aes(fill = housing_percent)) + # adjust factor included
  ggtitle("Percent of income spent on rent (2017 1-Bed)") + 
  scale_fill_gradientn(name = "% spent on rent", 
                       colours = myPalette2(10)) + # add limits and breaks to normalize scaling between years  
  north(data = acs.fmr.2017, location = "bottomright", symbol = 12) +
  scalebar(data = acs.fmr.2017, dist = 500, dist_unit = 'mi', transform = TRUE,
           location = "bottomleft",
           st.dist = 0.1) +
  my_theme()

rent_income_map # show the median income map

mapthegap data

## Warning: NAs introduced by coercion

The final piece of cost-of-living data to be used in this project is food expense data. I downloaded “meal cost” data from Feeding America, a non profit organization studying food costs and food insecurity in the United states. The “price per meal” value was calculated by the organization in 2010 using the local price of a basket of goods defined by the FDA and adjusted for local tax rates. Again, I checked to see if price-per-meal was correlated with poverty rates alone. The ggplots show they do not.

Finally, I combined all housing costs and meal costs as a percentage of median income, plotted them along the poverty rate by county (2017-2020), and fit a linear regression to each plot. There seems to be a predictable linear relationship between these factors.

# generate a two-factor cost of living index by adding food and housing costs

acs.fmr.food.2017 %<>%
  mutate(COL = meal_percent + housing_percent)

acs.fmr.food.2018 %<>%
  mutate(COL = meal_percent + housing_percent)

acs.fmr.food.2019 %<>%
  mutate(COL = meal_percent + housing_percent)

acs.fmr.food.2020 %<>%
  mutate(COL = meal_percent + housing_percent)

COL_2017 <- ggplot(acs.fmr.food.2017, aes(x=COL, y=poverty_percent))+
  geom_point()+  
  geom_smooth(method = "lm", formula = y~x) +
  xlab("Combined food and housing costs") + 
  ylab("Poverty Rate %") + 
  ggtitle("2017 US Counties ") +
  theme(plot.title = element_text(hjust = 0.5)) +
  stat_regline_equation() 

COL_2018 <- ggplot(acs.fmr.food.2018, aes(x=COL, y=poverty_percent))+
  geom_point()+  
  geom_smooth(method = "lm", formula = y~x) +
  xlab("Combined food and housing costs") + 
  ylab("Poverty Rate %") + 
  ggtitle("2018 US Counties ") +
  theme(plot.title = element_text(hjust = 0.5)) +
  stat_regline_equation() 

COL_2019 <- ggplot(acs.fmr.food.2019, aes(x=COL, y=poverty_percent))+
  geom_point()+  
  geom_smooth(method = "lm", formula = y~x) +
  xlab("Combined food and housing costs") + 
  ylab("Poverty Rate %") + 
  ggtitle("2019 US Counties ") +
  theme(plot.title = element_text(hjust = 0.5)) +
  stat_regline_equation() 

COL_2020 <- ggplot(acs.fmr.food.2020, aes(x=COL, y=poverty_percent))+
  geom_point()+  
  geom_smooth(method = "lm", formula = y~x) +
  xlab("Combined food and housing costs") + 
  ylab("Poverty Rate %") + 
  ggtitle("2020 US Counties ") +
  theme(plot.title = element_text(hjust = 0.5)) +
  stat_regline_equation() 

combined_COL <- cowplot::plot_grid(COL_2017,COL_2018,COL_2019,COL_2020) 

combined_COL # display the combined graph

Next, I make visualization to see the contribution of such costs alone verses median income. Based on these analyses, poverty rate and income are directly correlated as expected. However, median housing cost and food expenses alone do not seem to be directly associated with the poverty rate. Furthermore, calculated variables such as percent-spent-on-housing and percent-spent-on-food do not seem to give better correlations to poverty rate compared to median income alone, From these results I will use housing costs, food costs, and either poverty rate or median income from aggreate county data to predict health outcomes from BRFSS data.

## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following object is masked from 'package:ggpubr':
## 
##     mutate
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise, summarize

Get BRFSS data

Finally, I downloaded the BRFSS dataset from 2017 to analyze some health outcomes based on this phone survey.

Data-wrangling to rename, filter, and refactor variables.

# select variables of interest
# clean BRFSS 2017 dataframe

brfss2017 <- brfss2017_raw %>%
  dplyr::select('MMSANAME', 'X_MMSA', 'X_RACE', 'SEX', 'X_AGEG5YR','EDUCA','INCOME2',
                'ADDEPEV2', 'CHCCOPD1', 'X_RFHLTH') %>%
  dplyr::rename('Metro Area'='MMSANAME','Metro_ID'='X_MMSA','age'='X_AGEG5YR','race'='X_RACE', 
         'sex'='SEX', 'education'='EDUCA','income'='INCOME2',
         'depression'='ADDEPEV2','copd'='CHCCOPD1', 'health'='X_RFHLTH') %>%
  dplyr::filter(race %in%  c(1,2,3,5,8)) %>%
  dplyr::filter(sex %in% c(1,2)) %>%
  dplyr::filter(age %in% seq(13)) %>%
  dplyr::filter(education %in%  c(1,2,3,4,5,6)) %>%
  dplyr::filter(income %in% c(1,2,3,4,5,6,7,8)) %>%
  dplyr::filter(depression %in%  c(1,2)) %>%
  dplyr::filter(copd %in% c(1,2)) %>%
  dplyr::filter(health %in% c(1,2))

brfss2017 %<>%
  dplyr::mutate('Metro_ID' = as.character(brfss2017$'Metro_ID')) %>%
  dplyr::mutate(race = factor(race, levels = c(1,2,3,5,8), 
                        labels = c( "White", "Black", "American Indian", 
                                    "Asian", "Hispanic"))) %>%
  dplyr::mutate(sex = factor(sex, levels = c(1,2), 
                        labels = c("Male", "Female"))) %>%
  
  dplyr::mutate(age = factor(age, levels = seq(13), 
                        labels = c("18-24","25-29","30-34","35-39","40-44","45-49","50-54",
                                   "55-59","60-64","65-69","70-74","75-29",">79"))) %>%
  dplyr::mutate(education = factor(education, levels = c(1,2,3,4,5,6), 
                        labels = c("None","Elementary", "Some HS", "HS",
                                   "Some College", "College"))) %>%
  dplyr::mutate(income = factor(income, levels = c(1,2,3,4,5,6,7,8), 
                        labels = c("10k","15k", "20k", "25k", "35k", "50k", "75k","> 75k"))) %>%
  dplyr::mutate(depression = factor(depression, levels = c(1,2), 
                        labels = c("yes", "no"))) %>%
  dplyr::mutate(copd = factor(copd, levels = c(1,2), 
                        labels = c("yes", "no"))) %>%
  dplyr::mutate(health = factor(health, levels = c(1,2), 
                        labels = c("good", "poor"))) 

Let’s look at the summary of the clean 2017 data frame. I am interested in sex, age, income, education, and race as demographic factors. I am interested in previous depression, COPD, and general health as outcomes. From the summary, there is a bias in sex and age. This might be explained by the fact that BRFSS is a phone based survey. It seems that income and education is skewed twoards highly educated, high earning individuals as well. The BRFSS might be biased towards home-owners and landline users over cell-phone users.

##   Metro Area          Metro_ID                      race            sex           age               education       income      depression    copd         health      
##  Length:176937      Length:176937      White          :138185   Male  :79948   18-24: 9372   None        :  210   10k  : 7348   yes: 35853   yes: 13249   good:146650  
##  Class :character   Class :character   Black          : 18934   Female:96989   25-29: 9552   Elementary  : 3202   15k  : 7613   no :141084   no :163688   poor: 30287  
##  Mode  :character   Mode  :character   American Indian:  1806                  30-34:10675   Some HS     : 6744   20k  :11613                                          
##                                        Asian          :   385                  35-39:11531   HS          :41713   25k  :14631                                          
##                                        Hispanic       : 17627                  40-44:11104   Some College:48317   35k  :17058                                          
##                                                                                45-49:13113   College     :76751   50k  :23933                                          
##                                                                                50-54:15770                        75k  :27922                                          
##                                                                                55-59:18469                        > 75k:66819                                          
##                                                                                60-64:19574                                                                             
##                                                                                65-69:19402                                                                             
##                                                                                70-74:15710                                                                             
##                                                                                75-29:10526                                                                             
##                                                                                >79  :12139

Next, I had to apply the appropriate metro code to previous county-level data for cost-of-living variabes. This is done by assigning all counties associated with one metropolitan area the same metro code, Lumping together multiple US counties into one geographical designation. This must be done because the BRFSS does not record which county participants live in, but rather their metro location.

Data Wrangling to apply an appropriate metro code to every county

Combine food, housing, income, and poverty data into one data frame by metro area rather than by county

Next, I looked at how the cost-of-living and aggregate income variables affected the outcomes of three chosen BRFSS health outcomes (history of depression, history of COPD, general health)

Look at how aggregate costs/income data and individual income data affect health outcomes

Rent effects on health outcomes

aggregate food expenses and health outcomes

Aggregate poverty effect on health outcomes

Unfortunately, in aggregate there does not seem to be a difference in case-control status for any of the chosen health outcomes based on secondary factors including rent, food expense, and median income. This is shown in above boxplots.

Next, I looked at how individual income associates with the same health outcomes.

Now we will run chi-square tests to see if income is indepdent from chosen

health outcomes.

##       
##          10k   15k   20k   25k   35k   50k   75k > 75k
##   good  3024  3345  5988  8708 10974 16542 20359 47235
##   poor  2467  2603  3159  3343  3118  3340  2690  3325
## 
##  Pearson's Chi-squared test
## 
## data:  brfss2017_health_table
## X-squared = 13558, df = 7, p-value < 0.00000000000000022
##      
##         10k   15k   20k   25k   35k   50k   75k > 75k
##   yes  2083  2196  2635  3141  3157  4095  4466  7479
##   no   3408  3752  6512  8910 10935 15787 18583 43081
## 
##  Pearson's Chi-squared test
## 
## data:  brfss2017_depression_table
## X-squared = 3625.9, df = 7, p-value < 0.00000000000000022
##      
##         10k   15k   20k   25k   35k   50k   75k > 75k
##   yes   988  1234  1360  1492  1374  1630  1343  1553
##   no   4503  4714  7787 10559 12718 18252 21706 49007
## 
##  Pearson's Chi-squared test
## 
## data:  brfss2017_COPD_table
## X-squared = 4926.9, df = 7, p-value < 0.00000000000000022

Based on visualizaitons and chi-squared tests, it seems that there is an association between income and general health and history of both COPD and depression. This is expected based on previous work. Indivdiduals from high earning families are more likely to report better health, have less instances of COPD as well as depression.

Next, I looked to create explore these health outcomes (depression, COPD, general health) using all individual demographic factors only by logistic regression.

Check model accuracy using individual demographics alone. Overall, there is decent predictive power using individual demographic data (including income) to all health outcomes. The AUC for depression, general health, and COPD are 0.6702, 0.7512, and 0.7597 respectively.

Finally, I explored these health outcomes (depression, COPD, general health) using both individual demographic factors and aggregate metropolitan factors by logistic regression. I hoped to add predictive power to the model by combining both individual data as well as aggregate secondary data.

check model accuracy using individual demographics and aggregate metro data

AUC barely improved (<0.01 increase) with the addition of secondary cost-of-living data.

##         1         2         3         4         5         6 
## 0.7964293 0.8271381 0.6311824 0.5844652 0.8867128 0.8109481

Adding in the additional variables from ACS median income an housing costs only slightly increased AUC, but not enough to merit including them into GLM models. Overall, the value of individual survey data is much more valuable in predicting history of depression, COPD, and general well-being/health.

Overall, this might because the the aggregate data is not granular enough. In order to do this analysis, multiple counties had to be reassigned to one metro label. However, cost-of-living expenses can vary greatly depending not only on which city one is located in but also which county and which part of the county one lives in. Thus, averaging housing and food expenses from multiple geographical areas does not provide fine enough detail to how one’s indiviual income might be affect by cost of living. In order to better improve these models, surveys should ask individuals for their housing costs or provide more granular geospatial labelling for individuals.

From this project, I have recapitulated previous works describing the associations with increased income and better health outcomes. In addition, I show that ACS/FMR data at the metro-area level is not a good predictor of BRFSS outcomes chosen. While it is thought that having high cost-of-living expenses might negatively health outcomes (due to stress, finanical and food insecurity, and lack of access to adequate primary/preventive healthcare), the secondary data sets could not be adequately applied to the BRFSS dataset to answer these questions.